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Introduction 

Revolutionary  advances  in  computer  graphics  technologies,  driven  by  the  needs  of  3D  gaming, 
have  resulted  in  specialized  SIMD  floating  point  rendering  engines  known  as  GPUs.  These 
GPUs  are  programmed  via  graphics  libraries  such  as  OpenGL,  but  have  very  general 
programming  architectures.  These  cards  are  handily  exceeding  Moore’s  law  performance 
predictions  and  are  expected  to  continue  to  do  so  for  some  time.  The  size  and  cost  competitive 
nature  of  the  gaming  industry  combine  to  make  these  systems  extremely  affordable.  Today, 
GPUs  with  over  40GF  can  be  bought  for  around  $300  and  they  are  expected  to  increase  to 
around  1000GF  for  about  that  same  cost  by  the  2005  timeframe.  These  systems  form  the  core  of 
distributed  interactive  systems  but  can  also  be  applied  to  many  processes  other  than  rendering 
and  visualization.  At  present,  the  non-visualization  uses  of  these  systems  have  been  limited  to 
classically  streaming  or  vector  floating  point  bound  processes. 

We  will  present  early  results  in  the  use  of  these  GPU  systems  to  perform  computations  on 
alternative  types  of  algorithms  that  are  not  traditionally  FLOP  bound,  such  as  those  utilized  in 
video  image  processing,  text  processing  and  semantic  graph  traversal  and  analysis.  Knowledge 
discovery  based  application  areas  should,  minimally,  benefit  from  the  extreme  memory 
bandwidths  present  on  GPU  systems  (over  23GB/sec  in  current  systems),  and  are  in  a  position  to 
exploit  the  FLOP  rich  GPU  environment  to  enhance  the  fidelity  and  complexity  of  their 
computation.  Some  of  our  early  studies  have  already  shown  orders  of  magnitude  performance 
speedup  for  specific  applications. 

The  GPU  Based  Compute  Platform 

Two  advantages  of  GPUs  are  their  extremely  high  memory  bandwidth  and  their  unique  gather 
capabilities.  We  are  investigating  applications  that  exploit  both  of  these  features.  As  a  key  first 
step  we  are  investigating  the  mapping  of  data  onto  the  current  GPU  architectures  via  pointer-less 
indirection  techniques  and  implicit  parallel  storage  techniques.  The  design  is  expected  to  draw 
from  recent  work  on  tiled,  paged  boundary  conditions  on  GPU  systems.  The  initial  targets  are 
temporal  image  processing  algorithms  commonly  used  in  the  processing  of  data  like  surveillance 
video  and  facial  biometrics.  Basic  algorithms  for  filtering  and  feature  detection  and  tracking  are 
being  implemented  and  demonstrated  to  apply  to  large,  parallel  data  streams. 

One  of  the  difficulties  in  the  scaling  and  parallelization  of  algorithms  on  GPU  systems  stems 
from  the  very  nature  of  the  data  structures.  Following  the  image  processing  work,  will  focus  on 
non-traditional  HPC  data  structures.  In  the  next  several  months  we  expect  to  investigate  the 
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applicability  of  GPU  system  to  string  and  list  processing  functions.  These  have  been  difficult  to 
map  onto  streaming  processing  systems,  but  recent  advances  in  pixel  shader  technology  would 
suggest  that  it  may  be  possible  to  perform  hundreds  of  parallel  text  searches  in  parallel  in  a 
streaming,  multi-pass  GPU  architecture.  We  intend  to  exploit  similar  advances  in  texture 
fetching  operations  to  investigate  the  use  of  GPUs  in  pointer-less  list  searching  and  comparison 
problems.  These  research  advances  hold  the  potential  of  allowing  these  systems  to  be  applied  to 
other  data  mining  problems  and  the  processing  of  transaction  orientated  data,  such  as  the  analysis 
of  web  traffic  or  semantic  graphs. 

The  GPU  enhanced  system  is  not  a  static  target  and  tremendous  advances  are  announced  on 
nearly  a  bi-annual  basis  by  vendors.  Additionally,  it  will  be  useful  to  compare  results  from  these 
GPU  based  systems  with  the  results  from  other  architectures  that  are  being  developed  in  parallel 
(e.g.  BlueGene/L  and  Merrimac).  In  parallel  with  the  basic  algorithmic  efforts,  we  are 
performing  research  into  the  integration  of  this  work  with  higher  level  semantic  languages  with 
multiple  system  targets.  The  integration  this  work,  in  particular  the  non-traditional  data  structures 
for  strings  and  lists  into  streaming  languages  such  as  Brook  will  allow  the  work  to  target  a 
number  of  other  real  or  simulated  architectures.  As  a  result,  virtual  performance  comparisons  can 
be  made  with  these  architectures.  As  efforts  in  this  space  progress,  the  model  will  be  adapted  to 
next  generation  graphics  architectures  such  as  upcoming  future  architectures  such  as  the 
proposed  “Cell”  based  systems. 
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Trends  in  the  graphics  marketplace 


•  Inherent  parallelism  of  graphics  tasks 

•  Performance  increasing  faster  than  for  CPUs 

•  Move  to  programmable  hardware 

•-EffectS-of  mass-markets _ 


1  Not  expelpted  to  end  anytime  soon. . . 

•  Toapy:  40GF,  2GB/s  I/O,!  30GB/s  memory 

•  2006:  100GF,  8GB/s  I/O,  60GB/s  memory 

•  2007:  ITf!. 


The  IIV40  and  the  Sony  Playstation  3 


Are  graphics  trends  a  glimpse  of  the 
f-it@fe?- - 

The  nVidia  NV40  Architecture 

— 2S6NHT-RAM - : - 1 - - ! - ' - 

•  128  32bit  IEEE  FP  units  @  400Mhz 

•  220M  transistors,  HOW  of  power 

The  PlayStatP>ii3  (patent  application) 

•  Core  component  is  a  cell 

-  1  "PowerPC"  CPl!  + 

8  APUs  (^vectorial"  processors )- 

1  4GHz,lj28K  RAM,  256GFLOP/cell 

•  Multiple  cells  (Phone,  PDA,  PS3,  .!.) 

■  Four  cell  architecture  (1TFLOP) 

■  Central  64MB  memory 

Keys 

•  Streaming  data  models 


PSx  128  bit  DDR  II  Local  Memory 

Cache  controller  &  5  channels  Crossbar  &  4  channel  Memory  Controller 
Texture  and  Geometry  Caches  f _ 


nVidia  NV30 


r\ur  engine 

L.y.s|.S!.y.Sii£|-£|iy-L'L'L-L'y'-J'L- 


Cache-driven/cache-oblivious  computing 


nVidia  NV40 


Programmable  FP  SIMD  engines, 
40-100GF  today,  1TF  by  '06 

Where  can  they  be  exploited? 

•  Many  advantages  for  the  data  pipeline 

•  Data/aLgorith mic.desj_g n  chaJlenges _ . 

•  Possible  applicability  for  simulation 

•  Many  current  research  projects 
on  scientific  computing, 
databases,  aidio  processing 


Volume  A 


GPU 


Vertex 

Program 


«  -Gtirrefit-projeet-s 

•  Programmable  rendering  pipeline 


Volume  B 


■  MultP-variate,  interactive 

■  Increased  graphics  precision 


•  Image  composition  pipeline 

•  Implementation  of  physics  based  rendering 

■  Simulated  radiography,  diffraction  computation 

•  Large  image  geo-registration 

■  lOOx  performance  improvement  over  CPU 


Fragment 

Program 
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■  Investigate  use  of  COTS  technologies  for  computation 

•  "Non-traditional"  applications 

■  Image  and  speech 

■  String,  statistical,  graph... 

•  Mechanisms  necessary  for  exploitation 

■  Data  infra st nurture  (e.g.  cache  coherent  streaming...) 

-  JR  Software  abstractions- 

•  Delineate  some  boundary  conditions  on  their  use 

\  y  Evaluation  vs  CPU  based  solutions 
\  ^  Parameter-space  investigation 


Forms  the  basis  of  a  comparative 
framework 


•  Support  both  GPU  and  CPU  algorithmic 
implementations 

•  Targets  m IftrpTe  platforms 

•  Pnovides  lata  abstraction 


“■  "Tile- based"  streaming 
T  "w  Cache  coherency  control 
\  ■  CftU  to  GPU  to  CPU  glue  layer 

•  Utilizes  highel-level  languages  for  algorithms 

^  ^Cg,  Brook,  GLSL,letc 
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Common  attributes 


•  Large,  streaming  imagery  on  a  single  gfx  card 

•  Parallel  ID  and  2D  applications 

•  Multi-spectral  (four,  possibly  temporal 

channel  s) _ 

■  Discrete  Ipoflyoljtion  I 

•  Apltfary  kernels 

I  Correlation 

•  Separate  threshold,  search,  and  /detection 
phase  included 
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Representation  and  bandwidth 


characteristics 
String  comparison 


•  "Bulk"  comparison  operations  indiK/idoal 
—Outputs _ 


1  String  sorting _ 

•  Based  oh  stfjng  comparison  I 

•  Batcied  sortl>ased  on  radix  algoritlms 

■  String  searching 

•  "Wildcard"  pattern  matching 

•  Sort-based  element  search 
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■  Image  transforms 

•  FFT,  Wavelet 

•  Many  application  domains 

■  Statistical  functions  on 'images _ 

•  Moments,  regression  (general  linear  model) 

- ^-Hypottresis/rrroctel  driverr  image-processingy-texture- 

character  zation,  etc 

•  FI1da|en  Markov  Models 

■  G|a  pn^search 

•  Structured  (full|y  connected)  or  unstructured  graphs, 
detect  and  return  lowest  cost  path 

•  Many  application  domains 


Constrained  system  targets  based  on  resource 
limgts 

Hardware  targets 

_*__nVJdjaj_  NV3x,_NV_4xt-  NV_5x 

■  Focus  on  NV4x  due  to  new  branching  capabilities 

■  Dual  -GPU-IA32-  platform - 

■  PCI-Express  (PCIe)  enhanced  readback  and  async 
-bandwidth  - 

•  BG/if  and  Merr  mac 

OS  targets 

•  PrimarilyTiinux)  some  Windows  due  to  driver  issues 

Language  targets 

•  nVidia  Cg,  Brook 


All  timings  count  download, 
render,  and  readback 

Hrst  Tender  pass  is"exc lodged 


f-Fom-tbe  -Gouot 


■  Overhead  to  load  shader  can  be 
substantial 
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■  Software  vs.  two-texture  hardware  implementation 

■  At  all  but  the  smallest  kernel  sizes,  GPUs  are  much  faster 

CPU  and  GPU  results,  512x512  images 


Avg  Render 
Time  (secs) 


Filter  Size 

□  Software  □  Hardware  8-bit  □  Hardware  16-bit  ■  Hardware  32-bit 
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■  Software  vs.  two-texture  hardware  implementation 

■  32-bit  textures  use  more  memory  bandwidth 


CPU  and  GPU  Results,  9x9  Kernel 


□  Software  □  Hardware  8-bit  □  Hardware  16-bit  □  Hardware  32-bit 
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3  5  7  9  11  13  15 


Filter  Size 
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Double  Precision 


■  Port  of  David  Bailey's  single-double 
Fortran  library*  to  NVidia's  Cg  language 

■  Can  emulate  double  precision 

■  Use  two  single-precision  floats 

■“High  ordernpat  pestTmateToThe  double} 
Low.  order  float  is_error  of.  that  estimate 

nP  Resulyng  precision  is  almost  double  /  / 

■(The  exponent  remains  at  single  range 


available  at  htpp://crd. Ibl.gov/~dhbailey/mpdist 


Convolution  with  single  and  emulated-double  arithmetic 

Double  precision  only  1.5x  slower  than  single  precision 
at  the  same  texture  depth 


One  Convolution  Pass, 
S1ng1^vs“Doubl  e'Pre'cts  i  ort 


32-bit  Textu  re  Size. 

□  1  component 

□  4  component 

□  2  component 


Single 


Double 
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Obtain  results  for  a  variety  of  algorithms 
inaffdihg  strings,  HMMs,  aria  FFTs 

Include  performance  and  accuracy 

txterraiiOTrew  arcni teeth  res  as  avai  latps- 
(e,qlMerfri  mac-)  I 

pxplqle  qtneniigrP-Pevel  languages  (e.g. 

-brook  implementations  and  other - 

streaming  languages) 

Launch  a  benchmarking  web  site: 
http://www.lln(gov/gaia 


